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rvj Abstract. In many networks, vertices have hidden attributes, or types, 

that are correlated with the networks topology. If the topology is known 
but these attributes are not, and if learning the attributes is costly, we 
need a method for choosing which vertex to query in order to learn as 
much as possible about the attributes of the other vertices. We assume 
the network is generated by a stochastic block model, but we make no 
assumptions about its assortativity or disassortativity. We choose which 
I I vertex to query using two methods: 1) maximizing the mutual infor- 

l_^ mation between its attributes and those of the others (a well-known 

^^ approach in active learning) and 2) maximizing the average agreement 

^^ between two independent samples of the conditional Gibbs distribution. 

*li Experimental results show that both these methods do much better than 

-t-^ simple heuristics. They also consistently identify certain vertices as im- 

I I portant by querying them early on. 
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On 

O 1 Introduction 

in 

^-^ Suppose we have a network, represented by a graph G = (V, E) with n vertices. 

^—1 Suppose further that each vertex v has a type t{v) e {1, . . . , fc}, representing the 

J^ value of some hidden attribute that takes fc different values. We are given the 

. ,-H graph G, and our goal is to learn the types t{v). One way we might do this is to 

/\^ assume that G is generated by some probabilistic model, in which its topology 

H is correlated with these types. 

The simplest such model, although by no means the only one to which our 
methods could be applied, is a stochastic block model. Here we assume that 
each pair of vertices u, v have an edge between them with a probability Pt(u).t(v)j 
and that these events are independent. Given an assignment t:V^— >{l,...,fc} 
of types to vertices, and a fc x fc matrix of probabilities pij, the likelihood of 



2 Xiaoran Yan, Yaojia Zhu et al. 

generating G in this model is 

k 

where rii = \{v G V : t{v) = i}\ is the number of vertices of type i, and Cij = 
\{{u,v) € E : t{u) = i,t{v) = j}\ is the number of edges from vertices of type i 
to vertices of type j. Note that ([I]) assumes that edges are directed, and ahows 
self-loops. We can disallow self-loops, or make the edges undirected, by replacing 
with ni{ni — 1) or ("') respectively, and/or taking the product over pairs of 



n: 

types i,j with i < j. 

We do not assume that pij takes one value when i = j and a smaller value 
when i ^ j. In other words, we do not assume an assortative community struc- 
ture, where vertices are more likely to be connected to other vertices of the same 
type. Nor do we require that pij = pji, since the directed nature of the edges 
may be important. For example, herbivores eat plants, but the reverse is usually 
not the case. This kind of stochastic block model is well-known in the sociology 
and machine learning communities (e.g. |20ll8lllfU] ) and has also been used in 
ecology to identify groups of species in food webs . 

Since we are interested in finding the labels t of the nodes, we integrate over 
the parameters Pij of the block model, in order to obtain the likelihood of G given 
t. If we assume a prior, in which the pij are independent this integral factorizes 
over the product (fl]). In particular, if each pij is chosen uniformly from [0, 1], we 
have 

C{G\t)^ jjj d{p,,}C{G\t,p) 

\nini—ei 



= n / dp,,p:;^ii-p,,y 



fe 1 

Of course, we could easily assume some other prior on [0, 1] for the pij, such 
as a beta distribution, and then optimize its parameters, but here we will stick 
to (pi) for its simplicity. If we assume a uniform prior over the assignments t, 
then Bayes' rule gives them a Gibbs distribution 

P{t)^P{t\G)(x£{G\t) . (3) 

Note that ([2]) is maximized when, for each pair of types, e^j is close to or to 
riirij. In other words, the most likely assignments are those where, for each pair 
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of types i,j, pairs of vertices of types i and j are either mostly connected or 
mostly unconnected. 

An alternate approach is to assume that the Pij take their maximum likeli- 
hood values 

Pij = argmax C{G\t,p) = Cij/niUj , (4) 

p 

and set £{G \ t) — £{G \ t,p). This approach was used, for instance, for a hier- 
archical block model in [6]. When k is fixed and the Ui are large, this will give 
results similar to (l2| , since the integral over p is tightly peaked around p. How- 
ever, for any particular finite graph it makes more sense, at least to a Bayesian, 
to integrate over the pij, since they obey a posterior distribution rather than 
taking a fixed value. Moreover, averaging over the parameters as in ^ discour- 
ages over-fitting, since the average likelihood goes down when we increase k and 
hence the volume of the parameter space. This should allow us to determine k 
automatically, although in this paper we set k by hand. 

We emphasize, however, that the approaches to active learning we discuss 
below are not tied to this particular type of block model. They can be adapted 
to a wide range of other probabilistic models in which topology is correlated 
with hidden attributes of the vertices. 

We note that Bilgic and Getoor have discussed ways to use network rela- 
tionships to improve active learning about vertices [3], and that Hanneke and 
Xing [5] have studied active learning for learning network topology. In contrast 
to [9] , we assume that the network topology is known, but that the types of the 
vertices are not. 



2 Active Learning 

In the active learning setting, the algorithm can learn the type of any given 
vertex, but at a cost — say, by devoting resources in the laboratory or the field. 
Since these resources are limited, it has to decide which vertex to query. Its goal 
is to query a small set of vertices, and use their types to make good guesses 
about the types of the remaining vertices. 

One natural approach (see, e.g., MacKay [12] or Guo and Greiner jBj) is to 
query the vertex v with the largest mutual information (MI) between its type 
t{v) and the types t{G\v) of the other vertices. We can write this as the difference 
between the entropy of t{G \ v) and its conditional entropy given t{v), 

MI(f ) = I{v; G\v) = H{G \v)- H{G \v\v) . (5) 

Here H{G \v\v) is the entropy, averaged over t(y) according to the marginal of 
t(y) in the Gibbs distribution, of the joint distribution of t{G \ v) conditioned 
on t{v). In other words, MI(u) is the expected decrease in the entropy of t{G\v) 
that will result from learning t{v). Since the mutual information is symmetric, 
we also have 

MI(w) = I{v; G\v) = H{v) -H{v\G\v) , (6) 
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where H(v) is the entropy of the marginal distribution of t{v), and H{v \G\v) 
is the entropy, on average, of the distribution of t{v) conditioned on the types of 
the other vertices. Thus a good vertex to query is one about which we are quite 
uncertain, so that H(v) is large — but which is strongly correlated with other 
vertices, so that H{v \G\v) is small. 

We estimate these entropies by sampling from the space of assignments ac- 
cording to the Gibbs distribution. Specifically, we use a single-site heat-bath 
Markov chain. At each step, it chooses a vertex v uniformly from among the 
unqueried vertices, and chooses t{v) according to the conditional distribution 
proportional to C{G \ i), assuming that the types of all other vertices stay fixed. 
In addition to exploring the space, this allows us to collect a sample of the con- 
ditional distribution of the chosen vertex v and its entropy. Since H{v \G \ v) 
is the average of the conditional entropy, and since H(v) is the entropy of the 
average conditional distribution, we can write 



/(w;G\i;) = -^PanP, + ^PanP, , (7) 

where Pi is the probability that t{v) — i. 

We offer no theoretical guarantees about the mixing time of the heat-bath 
Markov chain, and it is easy to see that there are families of graphs and values 
of k for which it grows exponentially with n. For instance, if G is an Erdos- 
Renyi random graph G{n, 1/2), in which each pair of vertices is independently 
connected with probability 1/2, and if fc = 2, it takes 2^^*^" ^ steps on average 
to switch from a state where most vertices are of type 1 to one where most 
are of type 2, since the "bottleneck" states where half the vertices are of each 
type have total probability 2~^'^" \ However, for the real- world networks we 
have tried so far, the Markov chain appears to converge to equilibrium, and give 
good estimates of MI(u), in a reasonable amount of time. We also improve our 
estimates by averaging over many runs, each one starting from an independently 
random initial state. 

To complete the description of the MI active learning algorithm, we say that 
it is in stage j if it has already queried j vertices. In that stage, it estimates 
MI(w) for each unqueried vertex w, using the Markov chain to sample from the 
Gibbs distribution conditioned on the types of the vertices queried so far. It then 
queries the vertex v with the largest MI. We provide it with t(v), and it moves 
on to the next stage. 

Another strategy is to query the vertex that maximizes another quantity, 
which we call the average agreement (AA). Given two type assignments ti,t2, 
define their agreement as the number of vertices on which they agree, 

\tir\t2\ = \{v:ti{v)=t2{v)}\ . (8) 

Since our goal is to label as many vertices correctly as possible, what we would 
really like to maximize is the agreement between an assignment ti, drawn from 
the Gibbs distribution, and the correct assignment t2- But since we don't know 
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^2, the best we can do is assume that it is drawn from the Gibbs distribution 
as weU. If we think of (ii,i2) as having a joint distribution, then querying v 
would project onto the part of this distribution where ti{v) = t2(y). So, we 
define AA(v) as the expected agreement between two assignments ii,i2 drawn 
independently from the Gibbs distribution, conditioned on the event that they 
agree at v. This gives us the following quantity: 

For instance, imagine that n = 6 and k — 2, that vertices 1, 2, and 3 always have 
the same type, and that this type and the types of vertices 4, 5, and 6 are chosen 
uniformly and independently from {1,2}. This gives 16 possible assignments, 
each of which appears with probability 1/16. If ti and ^2 agree at 1, then they 
also agree at 2 and 3, and 4, 5, and 6 are each in ti D ^2 with probability 1/2. 
So, AA(1) = 3 + 3/2 = 9/2. On the other hand, if ii n i2 agree at 6, then each of 
the other 5 vertices is in ti H ^2 with probability 1/2, so AA(6) = 1 + 5/2 = 7/2. 
Thus we should query one of the first three vertices, because doing so will tell 
us the types of two other vertices as well. 

We estimate AA(i') using the same heat-bath Gibbs sampler as for MI(v), 
except that we draw pairs of assignments (^1,^2) independently, by starting the 
Markov chain at two independently chosen initial states. We then estimate the 
numerator and denominator of (pi) by averaging over these pairs, giving the 
estimate 

,,, , Eit,,2)S{hiv)Mv))\tinh\ 

AA i; est = ^^ xu / \ -^ / w ' (^^) 

where S{i,j) = 1 if i = j and otherwise. We keep track of these averages 
for each vertex v as follows: each time we draw a pair (^1,^2), for each v € 



ti n ^2, we increment the numerator and the denominator of (10 1 by \ti 0^2! 
and 1 respectively, and for w ^ ii n ^2 we leave the numerator and denominator 
unchanged. This gives an alternate algorithm for active learning, where in each 
stage we query the vertex with the largest estimated AA. 

We judge the performance of these algorithms by asking, at each stage and 
for each vertex, with what probability the Gibbs distribution assigns it the cor- 
rect type. We can then plot, as a function of the stage j, what fraction of the 
unqueried vertices are assigned the correct type with probability at least q, for 
various thresholds q. 

3 Results 

We tested the MI and AA algorithms on Zachary's Karate Club [21], shown in 
Fig. [TI This is a social network consisting of 34 members of a karate club, where 
edges represent friendships. The club split into two factions, one centered around 
the instructor (vertex 1) and the other around the club president (vertex 34), 
each of which formed their own club. The network is highly associative, with 
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Fig. 1. Zachary's Karate Club. Vertices 1 and 34 are the instructor and president, 
and their communities are indicated by diamonds and circles. Shaded vertices 
are more peripheral, and have weaker ties to their communities. 




# of nodes queried 




# of nodes queried 






Fig. 2. Results of the active learning algorithms on Zachary's Karate Club net- 
work. In each stage we sample the Gibbs distribution using 100 independently 
chosen initial conditions, doing 2 x 10'* steps of the heat-bath Markov chain 
for each one, and computing averages using the last 10^ steps. The y axis 
shows the fraction of vertices, other than those queried so far, which are la- 
beled correctly by the conditional Gibbs distribution with probability at least g, 
for q = 0.1, 0.3, 0.5, 0.7, 0.9. The x axis is cut off after 9 queries, Fig.[7]left has the 
complete 0.9 curves. Left, we query the vertex with the largest mutual informa- 
tion (MI) between it and the rest of the network. Right, we query the vertex with 
the largest average agreement (A A) as defined in the text. After querying 4 or 
5 vertices, both methods assign the correct label to about 80% of the remaining 
vertices with probability 0.9 or greater. The AA algorithm performs somewhat 
better, with the accuracy quickly converging to 100% as it queries more vertices. 
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a high density of edges within each faction and a low density of edges between 
them. 

It is no surprise that, after querying just four or five vertices, both algorithms 
succeed in correctly identifying the types of most of the remaining vertices — i.e., 
to which faction they belong — with high accuracy. The AA algorithm performs 
slightly better, achieving an accuracy close to 100% after nine queries. These 
results are shown in Fig. [2] Of course, this network is quite small, and there 
are many community structure algorithms that identify the two factions with 
perfect or near-perfect accuracy; see e.g. |17|7j for reviews. 

Perhaps more interesting is the order in which these algorithms choose to 
query the vertices. In Fig. [3j we sort the vertices in order of the average stage 
at which they are queried. Both algorithms start by querying the two central 
vertices, the instructor and the president. They then query vertices such as 3, 9, 
and 10, which lie at the boundary between the two communities. At that point, 
the algorithms "understand" that the network consists of two assortative com- 
munities, and the boundary between the communities is clear. The last vertices 
to be queried are those such as 2, 4, and 24, which lie deep inside their commu- 
nities, so that their types are not in doubt. It is not clear why the AA algorithm 
performs better, but from this small experiment, it seems that it places a lower 
priority on peripheral vertices, such as 25, than the MI algorithm does. 

We also examined a food web of 488 species in the Weddell Sea, in the 
Antarctic |5|11) . This data set is very rich, but we focus on two particular 
attributes — the feeding type, and the part of the environment, or habitat, in 
which the species lives. The feeding type takes k = 5 values, namely herbivo- 
rous, carnivorous, omnivorous, detritivorous, or a primary producer. The habitat 
attribute takes k — 5 values, namely pelagic, benthic, benthopelagic, demersal, 
and land-based. 

We show results for both attributes in Fig.|4J For feeding type, after querying 
half the vertices, both algorithms correctly label about 75% of the remaining 
vertices. For the habitat attribute, both algorithms are less accurate, although 
AA performs significantly better than MI. Note that the accuracy is measured 
as a fraction of the un-queried vertices. It can decrease, for instance, if we query 
"easy" vertices early on, so that "hard" vertices form a larger fraction of the 
remaining ones. 

Fig. |4] also shows that both algorithms arrive at a stage at which they are 
either right most of the time, or wrong most of the time, about each of the 
remaining vertices. For the feeding type attribute, for instance, after the AA 
algorithm has queried 300 species, it labels 75% of the remaining vertices cor- 
rectly with probability 90% — but labels the other 25% correctly with probability 
less than 10%. In other words, it has a high degree of certainty about all the 
vertices, but is wrong about many of them. Its accuracy improves as it continues 
to query the vertices, but it doesn't achieve high accuracy on all the unqueried 
vertices until there are only about 60 of them left. For the habitat attribute, the 
MI algorithm gets a small fraction of the unqueried vertices wrong up until the 
very end of the learning process. 
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Fig. 3. The order in which the active learning algorithms query vertices in 
Zachary's Karate Club network, averaged over 10 independent runs of each 
algorithm. Error bars show the standard deviation. Both algorithms start by 
querying vertices 1 and 34, which are central to their respective communities, 
and then query vertices at the boundary between the two communities. 
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Fig. 4. Results for the Weddell Sea food web, averaged over 10 runs of each 
algorithm. In each stage we sample the Gibbs distribution using 100 indepen- 
dently chosen initial conditions, doing 5 x 10^ steps of the heat-bath Markov 
chain for each one, and computing averages using the last 2.5 x 10'' steps. The 
y axis shows the fraction of vertices, other than those queried so far, which are 
labeled correctly by the conditional Gibbs distribution with probability at least 
q, for g = 0.1, 0.3, 0.5, 0.7, 0.9. The x axis stops when there is only one unqueried 
vertex left. We show results for two attributes: above, the feeding type of the 
species, and below, the habitat in which it lives. After querying about half the 
species, both algorithms get to a stage where every species is either labeled 
correctly with high probability, or incorrectly with high probability. In other 
words, the algorithm is confident, but wrong, about a significant fraction of the 
species. Most of these are species which are poorly modeled by the stochastic 
block model — that is, those which would be misclassified even if one knew the 
types of all the other species. Left column, MI; right column, AA. 
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Thus both these algorithms find some vertices easy to classify, but others 
very hard. Delving into the data, we found that, to a large extent, the blame 
lies not with the learning algorithms themselves, but with the stochastic block 
model, and its ability to model the data given this particular attribute. For 
example, for the habitat attribute, these algorithms perform well on pelagic, 
demersal, and land-based species. But the benthic habitat, which is the largest 
and most diverse, includes species with many feeding types and trophic levels. 
These additional attributes have a large effect on the topology, but they are 
hidden from the block model in our experiments. 

As a result, more than half the benthic species are misclassified by the block 
model, in the sense that if we condition on the habitats of the other vertices, it 
believes the benthic species' most likely type is pelagic, benthopelagic, demersal, 
or land-based. Specifically, 212 of the 488 species are mislabeled by the most 
likely block model, 94% of them with confidence over 0.9, even when the habitats 
of all the other species are known. 

To draw an analogy, if a member of the karate club was good friends with 
the instructor, but joined the president's club because it was close to her favorite 
cafe — and if the block model did not have access to this information — we could 
not expect the learning algorithm to classify her correctly until it got around to 
querying her. Of course, we can also regard our algorithms' mistakes as evidence 
that these habitat types are not cut and dried. Biologists are well aware that 
there are "connector species" that connect one habitat to another, and belong 
to some extent to both. 

In order to confirm our hypothesis that it is the accuracy of the block model, 
as opposed to the performance of the learning algorithm, that causes some ver- 
tices to be misclassified, we modified the data set in an artificial way in order to 
make it consistent with the block model. Starting with the original data set, we 
iterated the following procedure: at each step, we assigned each species a new 
value of the habitat attribute, setting it equal to the most likely type according 
to the most likely block model, conditioned on the types of all other vertices. 
After 6 iterations of this process, changing the types of a total of 260 species, 
we reached a fixed point, where the type of each vertex is consistent with the 
block model's predictions. As we expected, and shown in Fig. [5] our learning 
algorithms perform perfectly on this modified data set, predicting the type of 
every species with accuracy over 90% after querying just 18% of them. 

A direct interpretation of the query order on the Weddell Sea food web is 
difficult due to the complexity of the network. However, the query orders for the 
two different attributes have a lot in common, suggesting that they agree to a 
large extent about the relative importance of the species. As shown in Fig. 6, 
the query orders are positively correlated, with a Pearsons coefficient of 0.553. 
The two attributes have a low correlation to begin with, as feeding types and 
habitats are close to orthogonal in ecosystems (species tend to fill the niches in 
the food chain wherever available). They have an Adjusted Mutual Information 
[TU] of 0.357, which varies from for a total lack of correlation (conditioned 
on the number of species of each type) and 1 for an exact match. As a result. 
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Fig. 5. Performance of the learning algorithms after the habitat attribute reas- 
signment. With this new data set which better matches our block model, both 
learning algorithms achieve excellent performance. Left, MI; right, AA. 



we believe that the correlation between the query orders is largely due to the 
common underlying topology and its effect on the learning process. 

4 Comparison with Simple Heuristics 

To put these results into perspective, we compared our active learning algorithms 
with some simple heuristics. These include: 1) querying the vertex with highest 
degree in the subgraph of unqueried vertices, 2) querying the vertex with highest 
shortest path betweenness centrality [4I14J in the subgraph of unqueried vertices 
and 3) querying a vertex uniformly at random from the unqueried ones. The first 
two heuristics are popular measures of centrality, which are believed to reflect 
the varying importance of the vertices in a network 1151 . 

We judge the performance of these heuristics using the same Gibbs sampling 
process for MI and AA. As Fig. [7| shows, on Zachary's Karate Club, although 
Degree and Betweenness did reasonably well, none of them beat MI or A A. For 
the Weddell Sea food web, however, the situation is more interesting. Random's 
curve still resembles a straight line, but its early performance is surprisingly good 
in comparison. We speculate that MI and AA need a burn-in process in the early 
stages of the process to achieve their full potential. Degree and Betweenness, 
on the other hand, did poorly throughout the process. It turns out some high 
degree/betweenness vertices are actually among the easiest to predict when rest 
the of graph is konwn. With some unpredictable low degree/betweenness vertices 
left unqueried, their accuracy remained quite low even when they had queried 
almost all the vertices. 

5 Conclusion 



Active learning, using mutual information and our average agreement measure, 
offers a new approach to dealing with networks where knowledge of vertex at- 
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Fig. 6. The comparison between query orders for two different attributes on the 
Weddeh Sea food web. y axis: mean query order for the habitat attribute over 10 
runs; y axis: mean query order for the feeding type attribute over 10 runs. Data 
is from the same experiment shown in Fig. [4] The Pearson's coefhcient between 
the query orders is 0.553, while the Adjusted Mutual Information between the 
attributes is 0.357. 
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Fig. 7. A comparison of the MI and AA learning algorithms with three simple 
heuristics. Left, Zachary's Karate Club at 0.9 accuracy threshold. Right, Weddell 
Sea food web with the feeding type attribute at 0.9 accuracy threshold. All data 
are collected using the same Gibbs samphng process as specified in Fig. [2] and 
Fig. [41 
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tributes is incomplete and costly. Given that Gibbs sampling is computationally 
expensive, however, we do not expect the methods we used here to scale to truly 
large networks. An interesting question is whether MI and AA can be estimated 
using other means, such as a message-passing algorithm — or where there are 
simple, scalable heuristics for selecting which vertex to query, based on some 
notion of betweenness or centrality, with similar performance. 

In addition, the type of block model we use here does not deal well with sparse 
networks, or with heterogeneous degree distributions. In particular, it tends to 
label high-degree and low-degree vertices as belonging to different types, with 
higher and lower values of pij. In future work, we will test the MI and AA 
algorithms on a degree-corrected block model, such as in |13ll6j . where the 
degrees of the vertices are part of the input, as opposed to data that the model 
is obliged to explain. 
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